This tutorial is based on two sources:

  1. https://hbctraining.github.io/Training-modules/IntroR/ by Meeta Mistry, Mary Piper, and Radhika Khetani

  2. https://rspatial.org/raster/sdm/index.html by Robert J. Hijmans and Jane Elith.

Part 1. How to use this tutorial document

This file should have opened in a web browser window. It doesn’t run anything in R by itself; instead you will need to copy and paste (or retype) commands from it.

Whenever you see something like this,

print("hello world") # comments look like this; you don't have to copy them
## [1] "hello world"

the tutorial will display two code boxes.

  • The box without the two hash characters (##) contains the command, which is the text that you will run in R. To run something in R, simply copy the text in the gray codebox into your console and press Return.

    • Note: Inside the command, anything that starts with a single # and is displayed in grey is just a comment; you don’t have to run it, but it doesn’t hurt anything if you copy and paste it into R.
  • The box with the ## is the result, which should correspond to what will be printed in your console window.

Part 2. Getting started with RStudio

Find the file “installscript.R” in the class network directory where you found this tutorial, and choose RStudio to open it in the RStudio program. RStudio is a development environment for R, which means it provides a graphical interface for writing code in the R programming language.

The RStudio interface has four main panels:

  • Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio. It’s the green box in the image below.

  • Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console. Right now it’s showing “installscript.R”. It’s the red box in the image below.

  • Environment shows all active objects and History keeps track of all commands run in the console. It’s the blue box in the image below.

  • Files/Plots/Packages/Help: several different tabs that show the active directory, plots, installed packages (more about this later), help files, etc. It’s the yellow box in the image below.

Screenshot of the RStudio panels

First you’re going to use the script editor (top left panel). Click “Source” in the top right of this panel to run the “installscript.R” script, which will install all the libraries you’ll need for this course. While it’s running, look at the other panels too.

Next find the console (the bottom left panel). When you type something into this command-line interface and hit Enter, the text you entered will be run in the R processor and the results will be returned. Right now you’ll see a bunch of text scrolling by as it installs the packages.

The command prompt

Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:

  • Console is ready to accept commands: >

If R is ready to accept commands, the R console shows a > prompt. Can you find this on your own screen?

When the console receives a command (by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it. After running, the console will show the results and come back with a new > prompt to wait for new commands.

  • Console is waiting for you to enter more data: +

If R is still waiting for you to enter more data because the code sent to the console isn’t a complete command yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. Often this can be due to you having not ‘closed’ a parenthesis or quotation.

If you can’t figure out why your command isn’t running, you can click inside the console window and press the Escape key to escape the command and bring back a new prompt >; then you can start over sending the command.

Using the command prompt

Once your libraries are finished installing and it shows a command prompt, run the following command by typing or pasting it into the console and hitting Enter:

getwd()
## [1] "/Users/jblois/Documents/GitHub/biodata_shortcourse/development"

This should show you where in the computer’s file structure is your current working directory. (It will NOT look like the result above.) If you look in the “Files” tab in the bottom right panel, you will see all the objects in this directory, which you can also get using the following command:

dir()
##  [1] "biodata_BobcatSTEM.Rproj" "climate"                 
##  [3] "course_overview.html"     "course_overview.Rmd"     
##  [5] "data-cleaning.R"          "day1_tutorial.html"      
##  [7] "day1_tutorial.Rmd"        "day2_tutorial.Rmd"       
##  [9] "day3_tutorial.Rmd"        "fix-paleoclimate.R"      
## [11] "gbif-download.R"          "images"                  
## [13] "neotoma-download.R"       "neotoma-raw.RData"

Try some other stuff to see how this works.

9+6 #you can just use it as a calculator
## [1] 15
sum(9,6) #you can also use functions instead of arithmetic symbols
## [1] 15

Now, try something deliberately wrong. Copy and paste this line of code into your console:

9+6+ 

If you look at your console, you will see that instead of an answer (15), you see the + underneath a line of code that says 9+6+. To complete the equation, type a 0 after the +. You have now ‘closed’ the line of code and gotten your answer.

Remember, you can always click inside the console window and press the Escape key to escape the command and bring back a new prompt >; then you can start over sending the command.

The script editor

Now try the script editor (top left window in RStudio). In your “installscript.R” window, paste in the following:

# I am adding 3 and 5!
3 + 5
## [1] 8

It didn’t run just because you wrote it in there. Highlight the pasted text and hit Ctrl+Enter (or click Run in the top right corner of the pane): the highlighted text will be sent to the console and your result will appear.

This is useful for when you need to run the same command multiple times, such as when you’re trying to get something right – that’s why it’s called the “editor”. You should make a habit of writing your commands in the code editor instead of the console, because then you can easily check them later to see exactly how you did it.

Syntax

Notice that the English comment in there started with the comment symbol, #. What happens if we do that same command without the #? Re-run the command after removing the # sign in the front: I am adding 3 and 5. R is fun! 3 + 5 Now R is trying to run that sentence as a command, and it doesn’t work. We get an error in the console “Error: unexpected symbol in”I am” means that the R interpreter did not know what to do with that command.” Things sent to the console won’t work unless they are properly constructed commands in the R language.

Use the # character to insert comments about what your code is doing. This, again, makes it easier to understand your own work later.

Assignment operator

To do useful and interesting things in R, we need to assign values to variables using the *_assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to a variable named x by running:

x  <-  3

The assignment operator (<-) assigns values on the right to variables on the left.

Variables

A variable in computer programming is a symbolic name for a location where information can be maintained and referenced. You can think of a variable like a “bucket” of information with a label on the outside. When referring to the bucket of information, we use the label on the bucket (the variable name), not the data stored in the bucket (the value).

In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value, 3.

Let’s create another variable called y and give it a value of 5.

y  <-  5

When assigning a value to an variable, R does not print anything to the console. You can tell it to print the value by typing the variable name:

y
## [1] 5

You can also view information on all the currently stored variables by looking in your Environment window in the upper right-hand corner of the RStudio interface.

Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation?

x+y
## [1] 8

Try assigning the results of this operation to another variable called number.

number  <-  x + y
number
## [1] 8

Questions:

  1. Change the value of the variable x to 5 using the assignment operator. What happens to number? Does it change?
  2. Now try changing the value of variable y to contain the value 10. What do you need to do to update the variable number to the new value of x + y? Show your results to an instructor. ***

Tips on variable names

Variables can be given almost any name, such as x, current_temperature, or subjectID. However, there are some rules / suggestions you should keep in mind:

  • R is case sensitive (e.g., X is different from x)
  • Variable names can’t start with a number (2x is not valid but x2 is)
  • You can’t use names of fundamental functions in R (e.g., if, else, for). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data) as variable names. – You can type ? followed by the name to see if the name is already in use by a built-in function.
  • Use short variable names; longer names = more typos.
  • Before you assign a new variable, check in the Environment tab to make sure you didn’t already use the name.

Data Storage

Data Types

Variables can contain values of specific types within R. The most common basic data types in R include:

  • "numeric" for any numerical value
  • "character" for text values, denoted by using quotes (““) around value
  • "logical" for TRUE and FALSE (the boolean data type)

The table below provides examples of each of the commonly used data types:

Data Type Examples
Numeric: 1, 1.5, 20, pi
Character: “anytext”, “5”, “TRUE”
Logical: TRUE, FALSE, T, F

Data Structures

We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when number was created, the result of the mathematical operation was a single value. Variables can store more than just a single value, they can store a multitude of different data structures. These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list).

Vector

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It can be constructed with the combine command, c(). It’s basically just a collection of values, mainly either numbers,

c(1, 40, 9, 22)
## [1]  1 40  9 22

or characters,

c("a", "b", "c", "q")
## [1] "a" "b" "c" "q"

or logical values.

c(TRUE, TRUE, FALSE, TRUE)
## [1]  TRUE  TRUE FALSE  TRUE

Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type. For example, if you were to try to create the following vector:

c("a", 9, 12, TRUE)
## [1] "a"    "9"    "12"   "TRUE"

R will turn it into the following by forcing (“coercing”) all the values to character type: [1] "a" "9" "12" "TRUE"

The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements. Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single bucket.

Let’s create a vector of specimen counts and assign it to a variable called specCounts. Run the following lines:

specCounts  <-  c(3000, 50000, 46)
specCounts
## [1]  3000 50000    46

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

Looking in your Environment tab, you can see that the specCounts variable you just created is numeric, starts at element 1 and ends at element 3 (i.e. it’s a vector containing 3 numeric values).

A vector can also contain characters. Run the following code to create another vector called species with three elements, where each element corresponds with the previous vector.

species <- c("crocodile", "trout", "panda")
species
## [1] "crocodile" "trout"     "panda"

Matrix

A matrix in R is a collection of vectors of the same length and type. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

Matrices are used commonly as part of the mathematical machinery of statistics. We don’t create these manually very often, but they’re very commonly used inside R functions. They are usually of numeric datatype and used in computational algorithms to serve as a checkpoint. For example, if input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

Data Frame

A data.frame is the most common data structure in R for storing data in tables, and it’s what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors).

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function. We give the function the different vectors we would like to bind together, and it creates the data frame. This function will only work for vectors of the same length.

df <- data.frame(species,specCounts)
df
##     species specCounts
## 1 crocodile       3000
## 2     trout      50000
## 3     panda         46

You can see that there are two columns, each one containing one of the input vectors.

List

Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures, one after another.

If you have variables of different data structures you wish to combine, you can put all of those into one list object by using the list() function and placing all the items you wish to combine within parentheses.

Run the following to construct a list called “list1” that contains all the data structures we’ve seen so far in this tutorial.

list1 <- list(number, species, specCounts)
list1
## [[1]]
## [1] 8
## 
## [[2]]
## [1] "crocodile" "trout"     "panda"    
## 
## [[3]]
## [1]  3000 50000    46

There are three components corresponding to the three different variables we passed in, and what you see is that structure of each is retained.

Functions

A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

The general usage for a function is the name of the function followed by parentheses:

function_name(input)

The input(s) are called arguments, which can include:

  1. the data structure or data structures on which the function operates
  2. specifications that alter the way the function operates

Not all functions take arguments, for example:

getwd()

However, most functions take one or more arguments. If you don’t specify a required argument when calling the function, you will receive an error. Other arguments are optional: if you don’t include them, the function will fall back on using a default. The defaults represent standard values that the author of the function specified as being “good enough in standard cases”, but if you want something specific, simply change the argument yourself with a value of your choice.

Basic functions

We have already used a few examples of basic functions in the previous lessons i.e getwd(), c(), and data.frame(). These functions are available as part of R’s built in capabilities, and we will explore a few more of these base functions below.

You can also get functions from external packages or libraries, or even write your own.

Let’s revisit the function c() that we have used previously to combine data into vectors. The arguments it takes are a collection of numbers, characters or strings (separated by a comma). The c() function performs the task of combining all the numbers or characters provided as arguments into a single vector. You can also pass an existing vector as one of the arguments in order to add elements to it:

specCountsLonger <- c(900,specCounts) #adds the new value at the beginning 
#or
specCountsLonger <- c(specCounts,900) #adds the new value at the end

What happens here is that we take the original vector specCounts (containing three elements), and add another item to one end. You can imagine doing this over and over again to build a vector.

Since R is used for statistical computing, many of the base functions involve mathematical operations. If interested, we have linked a detailed guide for performing basic statistical tests in R. One example of a base R mathematical function would be sqrt(). The input/argument must be a number, and the the output is the square root of that number. Let’s try finding the square root of 81:

sqrt(81)
## [1] 9

Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?

sqrt(specCounts)
## [1]  54.77226 223.60680   6.78233

In this case the function was called on each individual value of the vector specCounts and the respective results were displayed. Beware: this does not work with every function!

Let’s try another function, this time using one that we can change some of the options (arguments that change the behavior of the function), for example round:

round(3.14159)
## [1] 3

We can see that we get 3. That’s because the default is to round to the nearest whole number. What if we want a different number of significant digits?

Seeking help on arguments for functions

The best way of finding out this information is to use the help operator ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio that will provide a description of the function, usage, arguments, details, and examples:

?round

You can also use the examples() function to run the examples from the help file. (This one has a lot of examples!)

example(round)
## 
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2  0  0  2  2  4  4
## [1] -2  0  0  2  2  4  4
## 
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round> 
## round> ( x1 <- seq(-2, 4, by = .5) )
##  [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0
## 
## round> round(x1) #-- IEEE / IEC rounding !
##  [1] -2 -2 -1  0  0  0  1  2  2  2  3  4  4
## 
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
## 
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5  0.5  2.5
## 
## round> (non.int <- ceiling(x1) != floor(x1))
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [13] FALSE
## 
## round> x2 <- pi * 100^(-1:3)
## 
## round> round(x2, 3)
## [1]       0.031       3.142     314.159   31415.927 3141592.654
## 
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06

If you are already familiar with the function but just need to remind yourself of the names of the arguments, you can use:

str(round)
## function (x, digits = 0, ...)

This tells us that we can change the number of digits returned by adding an optional argument. We can type digits = 2 or however many we may want:

round(3.14159, digits = 2)
## [1] 3.14

Question:

Another commonly used base function is mean(). Use this function to calculate an average for the specCounts vector, and show your result to the instructor. (If you look at the help file, you will see that the arguments for the mean() function are supplied in a different data structure than the other functions we’ve seen so far.)


Data

The last thing we’re going to cover in this introduction is how to inspect data.

Selecting data using indexes and sequences

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors and how to access different elements, and then extend those concepts to dataframes.

Vectors

If we want to extract one or several values from a vector, we must provide one or several indexes using square brackets [ ] syntax. The index represents the location of the element within a vector (or the compartment number, if you think of the bucket analogy). R indexes start at 1.

Let’s start by creating a vector called age:

age  <-  c(15, 22, 45, 52, 73, 81)

Suppose we only wanted the fifth value of this vector, we would use the following syntax:

age[5]
## [1] 73

If we wanted all values except the fifth value of this vector, we would use the following:

age[-5]
## [1] 15 22 45 52 81

If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:

idx  <-  c(3,5,6) # create vector of the elements of interest
age[idx]
## [1] 45 73 81

To select a sequence of continuous values from a vector, we would use : which is a special operator that creates numeric vectors of integer in increasing or decreasing order. Let’s select the first four values from age:

age[1:4]
## [1] 15 22 45 52

Try reversing that to say 4:1 and see what happens!

Selection of values can also be performed using logical expressions. Logical operators include greater than (>), less than (<), and equal to (==). We can use logical expressions to determine whether a particular condition is true or false. Then, subset out the TRUE values:

age[age > 50]
## [1] 52 73 81

More details about using logical expressions to subset data can be found here

Dataframes

We’re going to use the built-in data set called iris. This single dataframe contains the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width, respectively, for 50 flowers from each of 3 species of iris, a total of 150 specimens. The species are Iris setosa, I. versicolor, and I. virginica. ####Inspecting data frames This is a small dataframe, so you can just look at it in the console first.

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Or check how many rows and columns it has with dim():

dim(iris)
## [1] 150   5

However, 150 lines is still a little inconvenient if you just want to see what the data in each column are generally like. Try this:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now you see just the first 6 lines, as well as the header. Each row holds information for a single specimen, and the columns contain information about the specimen’s measurements and species. What data type is each column? Check using str(), which we used before to inspect the arguments of a function. When you call it on a variable, it tells you about the data structure and types.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Uh-oh, what’s a factor data type? Just a character field with a restricted set of possible values called “levels”. Don’t worry about that for now.

(You can also look at this in a separate tab in RStudio. Choose “package:datasets” from the dropdown that currently says “Global Environment”. Then click on iris in the Environment tab to open the data table in a new tab in the same pane as the script editor.)

Selecting data from dataframes

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify index for each dimension. We use the same square bracket notation but rather than providing a single index, there are two indexes. Within the square bracket, row numbers come first followed by column numbers, and the two are separated by a comma.

iris[1, 1]   # element from the first row in the first column of the data frame
## [1] 5.1
iris[1, 3]   # element from the first row in the 3rd column
## [1] 1.4

To select whole rows, you provide only the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure:

iris[3, ]    #returns a vector containing all elements in the 3rd row
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3          4.7         3.2          1.3         0.2  setosa

If you were selecting specific columns from the data frame - the rows are left blank:

iris[ , 3]    #returns a vector containing all elements in the 3rd column
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
##  [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
##  [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
##  [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
##  [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
##  [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1

Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values:

iris[ , 1:2] #returns a dataframe containing first two columns
##     Sepal.Length Sepal.Width
## 1            5.1         3.5
## 2            4.9         3.0
## 3            4.7         3.2
## 4            4.6         3.1
## 5            5.0         3.6
## 6            5.4         3.9
## 7            4.6         3.4
## 8            5.0         3.4
## 9            4.4         2.9
## 10           4.9         3.1
## 11           5.4         3.7
## 12           4.8         3.4
## 13           4.8         3.0
## 14           4.3         3.0
## 15           5.8         4.0
## 16           5.7         4.4
## 17           5.4         3.9
## 18           5.1         3.5
## 19           5.7         3.8
## 20           5.1         3.8
## 21           5.4         3.4
## 22           5.1         3.7
## 23           4.6         3.6
## 24           5.1         3.3
## 25           4.8         3.4
## 26           5.0         3.0
## 27           5.0         3.4
## 28           5.2         3.5
## 29           5.2         3.4
## 30           4.7         3.2
## 31           4.8         3.1
## 32           5.4         3.4
## 33           5.2         4.1
## 34           5.5         4.2
## 35           4.9         3.1
## 36           5.0         3.2
## 37           5.5         3.5
## 38           4.9         3.6
## 39           4.4         3.0
## 40           5.1         3.4
## 41           5.0         3.5
## 42           4.5         2.3
## 43           4.4         3.2
## 44           5.0         3.5
## 45           5.1         3.8
## 46           4.8         3.0
## 47           5.1         3.8
## 48           4.6         3.2
## 49           5.3         3.7
## 50           5.0         3.3
## 51           7.0         3.2
## 52           6.4         3.2
## 53           6.9         3.1
## 54           5.5         2.3
## 55           6.5         2.8
## 56           5.7         2.8
## 57           6.3         3.3
## 58           4.9         2.4
## 59           6.6         2.9
## 60           5.2         2.7
## 61           5.0         2.0
## 62           5.9         3.0
## 63           6.0         2.2
## 64           6.1         2.9
## 65           5.6         2.9
## 66           6.7         3.1
## 67           5.6         3.0
## 68           5.8         2.7
## 69           6.2         2.2
## 70           5.6         2.5
## 71           5.9         3.2
## 72           6.1         2.8
## 73           6.3         2.5
## 74           6.1         2.8
## 75           6.4         2.9
## 76           6.6         3.0
## 77           6.8         2.8
## 78           6.7         3.0
## 79           6.0         2.9
## 80           5.7         2.6
## 81           5.5         2.4
## 82           5.5         2.4
## 83           5.8         2.7
## 84           6.0         2.7
## 85           5.4         3.0
## 86           6.0         3.4
## 87           6.7         3.1
## 88           6.3         2.3
## 89           5.6         3.0
## 90           5.5         2.5
## 91           5.5         2.6
## 92           6.1         3.0
## 93           5.8         2.6
## 94           5.0         2.3
## 95           5.6         2.7
## 96           5.7         3.0
## 97           5.7         2.9
## 98           6.2         2.9
## 99           5.1         2.5
## 100          5.7         2.8
## 101          6.3         3.3
## 102          5.8         2.7
## 103          7.1         3.0
## 104          6.3         2.9
## 105          6.5         3.0
## 106          7.6         3.0
## 107          4.9         2.5
## 108          7.3         2.9
## 109          6.7         2.5
## 110          7.2         3.6
## 111          6.5         3.2
## 112          6.4         2.7
## 113          6.8         3.0
## 114          5.7         2.5
## 115          5.8         2.8
## 116          6.4         3.2
## 117          6.5         3.0
## 118          7.7         3.8
## 119          7.7         2.6
## 120          6.0         2.2
## 121          6.9         3.2
## 122          5.6         2.8
## 123          7.7         2.8
## 124          6.3         2.7
## 125          6.7         3.3
## 126          7.2         3.2
## 127          6.2         2.8
## 128          6.1         3.0
## 129          6.4         2.8
## 130          7.2         3.0
## 131          7.4         2.8
## 132          7.9         3.8
## 133          6.4         2.8
## 134          6.3         2.8
## 135          6.1         2.6
## 136          7.7         3.0
## 137          6.3         3.4
## 138          6.4         3.1
## 139          6.0         3.0
## 140          6.9         3.1
## 141          6.7         3.1
## 142          6.9         3.1
## 143          5.8         2.7
## 144          6.8         3.2
## 145          6.7         3.3
## 146          6.7         3.0
## 147          6.3         2.5
## 148          6.5         3.0
## 149          6.2         3.4
## 150          5.9         3.0
iris[c(1,3,6), ] #returns a dataframe containing first, third and sixth rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For larger datasets, it can be tricky to remember which column number corresponds to a particular variable.In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

iris[1:3 , "Petal.Length"] # values of the Petal.Length column from the first three samples
## [1] 1.4 1.4 1.3

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the species names from our dataset, we can use:

iris$Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

You can use names() or colnames() to remind yourself of the column names. We can then supply index values to select specific values from that vector. For example, if we wanted the petal widths for the first five samples in iris:

colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
iris$Petal.Width[1:5]
## [1] 0.2 0.2 0.2 0.2 0.2

The $ allows you to select a single column by name, which is a one-dimensional vector that requires only one index and no commas. To select multiple columns by name, you need to make a vector of strings that correspond to column names and supply it to the dataframe name:

iris[, c("Petal.Length", "Petal.Width")]
##     Petal.Length Petal.Width
## 1            1.4         0.2
## 2            1.4         0.2
## 3            1.3         0.2
## 4            1.5         0.2
## 5            1.4         0.2
## 6            1.7         0.4
## 7            1.4         0.3
## 8            1.5         0.2
## 9            1.4         0.2
## 10           1.5         0.1
## 11           1.5         0.2
## 12           1.6         0.2
## 13           1.4         0.1
## 14           1.1         0.1
## 15           1.2         0.2
## 16           1.5         0.4
## 17           1.3         0.4
## 18           1.4         0.3
## 19           1.7         0.3
## 20           1.5         0.3
## 21           1.7         0.2
## 22           1.5         0.4
## 23           1.0         0.2
## 24           1.7         0.5
## 25           1.9         0.2
## 26           1.6         0.2
## 27           1.6         0.4
## 28           1.5         0.2
## 29           1.4         0.2
## 30           1.6         0.2
## 31           1.6         0.2
## 32           1.5         0.4
## 33           1.5         0.1
## 34           1.4         0.2
## 35           1.5         0.2
## 36           1.2         0.2
## 37           1.3         0.2
## 38           1.4         0.1
## 39           1.3         0.2
## 40           1.5         0.2
## 41           1.3         0.3
## 42           1.3         0.3
## 43           1.3         0.2
## 44           1.6         0.6
## 45           1.9         0.4
## 46           1.4         0.3
## 47           1.6         0.2
## 48           1.4         0.2
## 49           1.5         0.2
## 50           1.4         0.2
## 51           4.7         1.4
## 52           4.5         1.5
## 53           4.9         1.5
## 54           4.0         1.3
## 55           4.6         1.5
## 56           4.5         1.3
## 57           4.7         1.6
## 58           3.3         1.0
## 59           4.6         1.3
## 60           3.9         1.4
## 61           3.5         1.0
## 62           4.2         1.5
## 63           4.0         1.0
## 64           4.7         1.4
## 65           3.6         1.3
## 66           4.4         1.4
## 67           4.5         1.5
## 68           4.1         1.0
## 69           4.5         1.5
## 70           3.9         1.1
## 71           4.8         1.8
## 72           4.0         1.3
## 73           4.9         1.5
## 74           4.7         1.2
## 75           4.3         1.3
## 76           4.4         1.4
## 77           4.8         1.4
## 78           5.0         1.7
## 79           4.5         1.5
## 80           3.5         1.0
## 81           3.8         1.1
## 82           3.7         1.0
## 83           3.9         1.2
## 84           5.1         1.6
## 85           4.5         1.5
## 86           4.5         1.6
## 87           4.7         1.5
## 88           4.4         1.3
## 89           4.1         1.3
## 90           4.0         1.3
## 91           4.4         1.2
## 92           4.6         1.4
## 93           4.0         1.2
## 94           3.3         1.0
## 95           4.2         1.3
## 96           4.2         1.2
## 97           4.2         1.3
## 98           4.3         1.3
## 99           3.0         1.1
## 100          4.1         1.3
## 101          6.0         2.5
## 102          5.1         1.9
## 103          5.9         2.1
## 104          5.6         1.8
## 105          5.8         2.2
## 106          6.6         2.1
## 107          4.5         1.7
## 108          6.3         1.8
## 109          5.8         1.8
## 110          6.1         2.5
## 111          5.1         2.0
## 112          5.3         1.9
## 113          5.5         2.1
## 114          5.0         2.0
## 115          5.1         2.4
## 116          5.3         2.3
## 117          5.5         1.8
## 118          6.7         2.2
## 119          6.9         2.3
## 120          5.0         1.5
## 121          5.7         2.3
## 122          4.9         2.0
## 123          6.7         2.0
## 124          4.9         1.8
## 125          5.7         2.1
## 126          6.0         1.8
## 127          4.8         1.8
## 128          4.9         1.8
## 129          5.6         2.1
## 130          5.8         1.6
## 131          6.1         1.9
## 132          6.4         2.0
## 133          5.6         2.2
## 134          5.1         1.5
## 135          5.6         1.4
## 136          6.1         2.3
## 137          5.6         2.4
## 138          5.5         1.8
## 139          4.8         1.8
## 140          5.4         2.1
## 141          5.6         2.4
## 142          5.1         2.3
## 143          5.1         1.9
## 144          5.9         2.3
## 145          5.7         2.5
## 146          5.2         2.3
## 147          5.0         1.9
## 148          5.2         2.0
## 149          5.4         2.3
## 150          5.1         1.8

While there is no equivalent $ syntax to select a row by name, you can select specific rows using the row names (in this case just numbers).

rownames(iris)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"
iris[c("100", "150"),]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 100          5.7         2.8          4.1         1.3 versicolor
## 150          5.9         3.0          5.1         1.8  virginica

Subsetting data

Another way of partitioning dataframes is using the subset() function to return the rows of the dataframe for which the logical expression is TRUE. Allowing us to the subset the data in a single step. The syntax for the subset() function is:

subset(dataframe, column_name == "value") Any logical expression could replace the `== “value”. For example, we can look at the samples of the species setosa only:

subset(iris, Species == "setosa")
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa

Question:

Look at the results of the following commands. Exercise Look at the results of the following commands.

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
mean(subset(iris,Species == "setosa")$Petal.Width)
## [1] 0.246
mean(subset(iris,Species == "versicolor")$Petal.Width)
## [1] 1.326
mean(subset(iris,Species == "virginica")$Petal.Width)
## [1] 2.026
  1. Which species has the widest petals?
  2. Which species has the longest petals? Is it the same one? Edit the script to get it to give you the answer.

Part 3: Species ranges from GBIF data

In this section, the examples will be done using the species Morpho menelaus, the blue morpho butterfly. Wherever you see its name, you’ll substitute your own species’ name.

Open a new script (File>New File>R Script). Hit Ctrl+S to save it with the name “species-range” or something like that. It will prompt you to save it in the test folder of your login, which is fine.

Downloading your data

You will use a function in the R package dismo to download your data. Paste the following into your new script file, substituting your species name, and run it by selecting it and hitting Ctrl+Enter or clicking “Run” in the top right corner of the script pane:

require(dismo)
## Loading required package: dismo
## Loading required package: raster
## Loading required package: sp
gbif('Morpho','menelaus',geo = FALSE, download = FALSE) #find out the total number of occurrences for this species in the database -- if this doesn't match the number of occurrences you see on the website, you should see if you typed something wrong!
## [1] 1869
raw_data <- gbif('Morpho','menelaus',geo = TRUE) #download all the occurrences with longitude and latitude data, which may not be all of them
## 1869 records found
## 0-300-600-900-1200-1500-1800-1869 records downloaded

Inspect the data you downloaded:

df <- raw_data #copy the GBIF download file into another data frame so we can start cleaning it
dim(df) #GBIF returns a LOT of columns!
## [1] 1869  175

Look at the column names:

colnames(df)
##   [1] "acceptedNameUsage"                    
##   [2] "acceptedScientificName"               
##   [3] "acceptedTaxonKey"                     
##   [4] "accessRights"                         
##   [5] "adm1"                                 
##   [6] "adm2"                                 
##   [7] "associatedReferences"                 
##   [8] "associatedSequences"                  
##   [9] "basisOfRecord"                        
##  [10] "behavior"                             
##  [11] "bibliographicCitation"                
##  [12] "catalogNumber"                        
##  [13] "class"                                
##  [14] "classKey"                             
##  [15] "cloc"                                 
##  [16] "collectionCode"                       
##  [17] "collectionID"                         
##  [18] "collectionKey"                        
##  [19] "continent"                            
##  [20] "coordinatePrecision"                  
##  [21] "coordinateUncertaintyInMeters"        
##  [22] "country"                              
##  [23] "crawlId"                              
##  [24] "dataGeneralizations"                  
##  [25] "datasetID"                            
##  [26] "datasetKey"                           
##  [27] "datasetName"                          
##  [28] "dateIdentified"                       
##  [29] "day"                                  
##  [30] "depth"                                
##  [31] "depthAccuracy"                        
##  [32] "disposition"                          
##  [33] "distanceFromCentroidInMeters"         
##  [34] "dynamicProperties"                    
##  [35] "elevation"                            
##  [36] "elevationAccuracy"                    
##  [37] "endDayOfYear"                         
##  [38] "establishmentMeans"                   
##  [39] "eventDate"                            
##  [40] "eventID"                              
##  [41] "eventRemarks"                         
##  [42] "eventTime"                            
##  [43] "eventType"                            
##  [44] "family"                               
##  [45] "familyKey"                            
##  [46] "fieldNotes"                           
##  [47] "fieldNumber"                          
##  [48] "footprintSRS"                         
##  [49] "footprintWKT"                         
##  [50] "fullCountry"                          
##  [51] "gbifID"                               
##  [52] "gbifRegion"                           
##  [53] "genericName"                          
##  [54] "genus"                                
##  [55] "genusKey"                             
##  [56] "geodeticDatum"                        
##  [57] "georeferencedBy"                      
##  [58] "georeferencedDate"                    
##  [59] "georeferenceProtocol"                 
##  [60] "georeferenceRemarks"                  
##  [61] "georeferenceSources"                  
##  [62] "georeferenceVerificationStatus"       
##  [63] "habitat"                              
##  [64] "higherClassification"                 
##  [65] "higherGeography"                      
##  [66] "higherGeographyID"                    
##  [67] "hostingOrganizationKey"               
##  [68] "http://unknown.org/captive_cultivated"
##  [69] "http://unknown.org/language"          
##  [70] "http://unknown.org/modified"          
##  [71] "http://unknown.org/nick"              
##  [72] "http://unknown.org/orders"            
##  [73] "http://unknown.org/recordEnteredBy"   
##  [74] "http://unknown.org/recordID"          
##  [75] "identificationID"                     
##  [76] "identificationReferences"             
##  [77] "identificationRemarks"                
##  [78] "identificationVerificationStatus"     
##  [79] "identifiedBy"                         
##  [80] "identifier"                           
##  [81] "individualCount"                      
##  [82] "informationWithheld"                  
##  [83] "infraspecificEpithet"                 
##  [84] "installationKey"                      
##  [85] "institutionCode"                      
##  [86] "institutionID"                        
##  [87] "institutionKey"                       
##  [88] "isInCluster"                          
##  [89] "ISO2"                                 
##  [90] "isSequenced"                          
##  [91] "iucnRedListCategory"                  
##  [92] "key"                                  
##  [93] "kingdom"                              
##  [94] "kingdomKey"                           
##  [95] "language"                             
##  [96] "lastCrawled"                          
##  [97] "lastInterpreted"                      
##  [98] "lastParsed"                           
##  [99] "lat"                                  
## [100] "license"                              
## [101] "lifeStage"                            
## [102] "locality"                             
## [103] "locationID"                           
## [104] "locationRemarks"                      
## [105] "lon"                                  
## [106] "materialEntityID"                     
## [107] "modified"                             
## [108] "month"                                
## [109] "municipality"                         
## [110] "nameAccordingTo"                      
## [111] "nomenclaturalCode"                    
## [112] "occurrenceID"                         
## [113] "occurrenceRemarks"                    
## [114] "occurrenceStatus"                     
## [115] "order"                                
## [116] "orderKey"                             
## [117] "organismID"                           
## [118] "organismQuantity"                     
## [119] "organismQuantityType"                 
## [120] "originalNameUsage"                    
## [121] "otherCatalogNumbers"                  
## [122] "ownerInstitutionCode"                 
## [123] "parentNameUsage"                      
## [124] "phylum"                               
## [125] "phylumKey"                            
## [126] "preparations"                         
## [127] "previousIdentifications"              
## [128] "programmeAcronym"                     
## [129] "projectId"                            
## [130] "protocol"                             
## [131] "publishedByGbifRegion"                
## [132] "publishingCountry"                    
## [133] "publishingOrgKey"                     
## [134] "recordedBy"                           
## [135] "recordNumber"                         
## [136] "references"                           
## [137] "reproductiveCondition"                
## [138] "rights"                               
## [139] "rightsHolder"                         
## [140] "sampleSizeUnit"                       
## [141] "sampleSizeValue"                      
## [142] "samplingEffort"                       
## [143] "samplingProtocol"                     
## [144] "scientificName"                       
## [145] "scientificNameID"                     
## [146] "sex"                                  
## [147] "species"                              
## [148] "speciesKey"                           
## [149] "specificEpithet"                      
## [150] "startDayOfYear"                       
## [151] "subfamily"                            
## [152] "superfamily"                          
## [153] "taxonConceptID"                       
## [154] "taxonID"                              
## [155] "taxonKey"                             
## [156] "taxonomicStatus"                      
## [157] "taxonRank"                            
## [158] "taxonRemarks"                         
## [159] "tribe"                                
## [160] "type"                                 
## [161] "typeStatus"                           
## [162] "typifiedName"                         
## [163] "verbatimCoordinateSystem"             
## [164] "verbatimElevation"                    
## [165] "verbatimEventDate"                    
## [166] "verbatimIdentification"               
## [167] "verbatimLabel"                        
## [168] "verbatimLocality"                     
## [169] "verbatimSRS"                          
## [170] "verbatimTaxonRank"                    
## [171] "vernacularName"                       
## [172] "vitality"                             
## [173] "waterBody"                            
## [174] "year"                                 
## [175] "downloadDate"

Look at some of the fields for the first six rows:

head(df)[,c("species","continent","country","adm1","lat","lon")]
##           species     continent    country           adm1        lat       lon
## 1 Morpho menelaus SOUTH_AMERICA     Brazil Rio de Janeiro -22.421437 -42.72357
## 2 Morpho menelaus NORTH_AMERICA Costa Rica     Puntarenas   8.619720 -83.47618
## 3 Morpho menelaus SOUTH_AMERICA    Ecuador           Napo  -0.946811 -77.86990
## 4 Morpho menelaus SOUTH_AMERICA     Brazil       Rondônia  -9.879575 -62.83085
## 5 Morpho menelaus SOUTH_AMERICA       Peru  Madre de Dios -12.225983 -69.11453
## 6 Morpho menelaus SOUTH_AMERICA     Brazil Espírito Santo -19.066258 -40.14829

Data cleaning

First we’re going to remove all the occurrences that don’t have latitude and longitude data.

df  <-  subset(df,!is.na(df$lon) & !is.na(df$lat))
nrow(df) #how many data points do we have now?
## [1] 1035

Then we transform all the negative longitude values so that the range goes from 0 to 360 instead of from -180 to 180. This will allow us to plot it on our map. We will add this as an extra column in “df” so that we can use either version.

westlongitudes  <-  which(df$lon < 0)
df[,"lon360"]  <- df[,"lon"] 
df[westlongitudes,"lon360"]  <-  360 + df[westlongitudes,"lon"] 
#Do you understand how these two lines work?

Next, we make a simple map to look for errors:

require(maps) #load the mapping library
## Loading required package: maps
map("world2",col = "green") #generate the map
map.axes() #label the axes (longitude and latitude) 
points(df$lon360,df$lat,col = "red",pch = 20) #plot the species occurrence points

This will be easier to read if we make it so the map shows only the part of the Earth where GBIF has occurrence records for our species.

map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), #one extra degree on each side for visibility
     ylim = range(df$lat,na.rm = T) + c(-1,1)) 
points(df$lon360,df$lat,col = "red",pch = 20)
map.axes()

In the example of the blue morpho butterflies, you can see that almost all the occurrences are from the tropical parts of South and Central America, but there are also a few others in Europe and Oceania. This /could/ be GBIF keeps track of museum specimens as well. What kinds of data points are included in your data set, and how many of them?

table(df[,"basisOfRecord"])
## 
##  HUMAN_OBSERVATION    MATERIAL_SAMPLE         OCCURRENCE PRESERVED_SPECIMEN 
##                539                 16                 24                456

But we don’t want those – we’re trying to look at the actual habitat range of the living species. We should make sure we’re only dealing with live observations, not with fossil or preserved specimens. Which points are /not/ from observations of living animals?

notobs <- which(!(df$basisOfRecord == "HUMAN_OBSERVATION" | df$basisOfRecord == "OBSERVATION" | df$basisOfRecord == "OCCURRENCE"))
map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1)) 
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[notobs,]$lon360,df[notobs,]$lat,col = "black",pch = 21)
map.axes()

The points that aren’t from actual observations of living butterflies are circled in black. We’ll remove these points:

remove <- notobs
df <- df[-remove,] #remove the incorrect points
rm(remove)
nrow(df) #how many left now? 
## [1] 563

Plot what’s left again to see if anything looks like it’s in the wrong place. Then we’ll plot the data again to make sure there’s nothing else that stands out as probably incorrect:

map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20) #plot again with only the real data

Recall what you know about your species’ range. Do any of these occurrences look like they might be errors?

Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. Such efforts do not specifically gather data for the purpose of species distribution modeling, so you need to understand the data and clean them appropriately, for your application.

My example species, the blue morpho butterfly Morpho menelaus, lives in South and Central American tropical rainforests. The points in Northern Europe seem pretty suspicious, on that basis; maybe they’re tagged incorrectly, and are actually captive individuals in a zoo, or even dead preserved specimens? If you have data points in suspicious locations, take a look at them by filtering the latitude or longitude:

test1  <-  which(df$lon360 < 250) #tagging all points that aren't in the Americas, by longitude
map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[test1,]$lon360,df[test1,]$lat,col = "black",pch = 21) #circle the flagged points in black

All the points we’ve identified as being in the wrong place are now circled in black. What can we find out about them?

df[test1,]
##     acceptedNameUsage           acceptedScientificName acceptedTaxonKey
## 115              <NA> Morpho menelaus (Linnaeus, 1758)          5133523
## 313              <NA> Morpho menelaus (Linnaeus, 1758)          5133523
## 314              <NA> Morpho menelaus (Linnaeus, 1758)          5133523
##                                         accessRights    adm1 adm2
## 115 http://www.natuurpunt.be/normen-voor-datagebruik Antwerp <NA>
## 313 http://www.natuurpunt.be/normen-voor-datagebruik Antwerp <NA>
## 314 http://www.natuurpunt.be/normen-voor-datagebruik Antwerp <NA>
##     associatedReferences associatedSequences     basisOfRecord behavior
## 115                 <NA>                <NA> HUMAN_OBSERVATION     <NA>
## 313                 <NA>                <NA> HUMAN_OBSERVATION     <NA>
## 314                 <NA>                <NA> HUMAN_OBSERVATION     <NA>
##     bibliographicCitation catalogNumber   class classKey
## 115                  <NA>          <NA> Insecta      216
## 313                  <NA>          <NA> Insecta      216
## 314                  <NA>          <NA> Insecta      216
##                         cloc  collectionCode collectionID collectionKey
## 115 Antwerp, Belgium, EUROPE waarnemingen.be         <NA>          <NA>
## 313 Antwerp, Belgium, EUROPE waarnemingen.be         <NA>          <NA>
## 314 Antwerp, Belgium, EUROPE waarnemingen.be         <NA>          <NA>
##     continent coordinatePrecision coordinateUncertaintyInMeters country crawlId
## 115    EUROPE                  NA                             2 Belgium     431
## 313    EUROPE                  NA                            25 Belgium     431
## 314    EUROPE                  NA                            25 Belgium     431
##     dataGeneralizations                       datasetID
## 115                <NA> https://doi.org/10.15468/k2aiak
## 313                <NA> https://doi.org/10.15468/k2aiak
## 314                <NA> https://doi.org/10.15468/k2aiak
##                               datasetKey
## 115 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
## 313 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
## 314 9a0b66df-7535-4f28-9f4e-5bc11b8b096c
##                                                                                              datasetName
## 115 Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium
## 313 Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium
## 314 Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium
##     dateIdentified day depth depthAccuracy disposition
## 115           <NA>   5    NA            NA        <NA>
## 313           <NA>  26    NA            NA        <NA>
## 314           <NA>  26    NA            NA        <NA>
##     distanceFromCentroidInMeters dynamicProperties elevation elevationAccuracy
## 115                           NA              <NA>        NA                NA
## 313                           NA              <NA>        NA                NA
## 314                           NA              <NA>        NA                NA
##     endDayOfYear establishmentMeans  eventDate eventID eventRemarks eventTime
## 115          249               <NA> 2024-09-05    <NA>         <NA>      <NA>
## 313          238               <NA> 2019-08-26    <NA>         <NA>      <NA>
## 314          238               <NA> 2019-08-26    <NA>         <NA>      <NA>
##     eventType      family familyKey fieldNotes fieldNumber footprintSRS
## 115      <NA> Nymphalidae      7017       <NA>        <NA>         <NA>
## 313      <NA> Nymphalidae      7017       <NA>        <NA>         <NA>
## 314      <NA> Nymphalidae      7017       <NA>        <NA>         <NA>
##     footprintWKT fullCountry     gbifID gbifRegion genericName  genus genusKey
## 115         <NA>     Belgium 4930264131     EUROPE      Morpho Morpho  1909888
## 313         <NA>     Belgium 2573319358     EUROPE      Morpho Morpho  1909888
## 314         <NA>     Belgium 2573319423     EUROPE      Morpho Morpho  1909888
##     geodeticDatum georeferencedBy georeferencedDate georeferenceProtocol
## 115         WGS84            <NA>              <NA>                 <NA>
## 313         WGS84            <NA>              <NA>                 <NA>
## 314         WGS84            <NA>              <NA>                 <NA>
##     georeferenceRemarks georeferenceSources georeferenceVerificationStatus
## 115                <NA>                <NA>                           <NA>
## 313                <NA>                <NA>                           <NA>
## 314                <NA>                <NA>                           <NA>
##     habitat higherClassification higherGeography higherGeographyID
## 115    <NA>                 <NA>            <NA>              <NA>
## 313    <NA>                 <NA>            <NA>              <NA>
## 314    <NA>                 <NA>            <NA>              <NA>
##                   hostingOrganizationKey http://unknown.org/captive_cultivated
## 115 1cd669d0-80ea-11de-a9d0-f1765f95f18b                                  <NA>
## 313 1cd669d0-80ea-11de-a9d0-f1765f95f18b                                  <NA>
## 314 1cd669d0-80ea-11de-a9d0-f1765f95f18b                                  <NA>
##     http://unknown.org/language http://unknown.org/modified
## 115                        <NA>                        <NA>
## 313                        <NA>                        <NA>
## 314                        <NA>                        <NA>
##     http://unknown.org/nick http://unknown.org/orders
## 115                    <NA>                      <NA>
## 313                    <NA>                      <NA>
## 314                    <NA>                      <NA>
##     http://unknown.org/recordEnteredBy http://unknown.org/recordID
## 115                               <NA>                        <NA>
## 313                               <NA>                        <NA>
## 314                               <NA>                        <NA>
##     identificationID identificationReferences identificationRemarks
## 115             <NA>                     <NA>                  <NA>
## 313             <NA>                     <NA>                  <NA>
## 314             <NA>                     <NA>                  <NA>
##      identificationVerificationStatus identifiedBy
## 115                        unverified         <NA>
## 313 approved on photographic evidence         <NA>
## 314 approved on photographic evidence         <NA>
##                            identifier individualCount informationWithheld
## 115 Natuurpunt:Waarnemingen:327260672               1        see metadata
## 313 Natuurpunt:Waarnemingen:185072398               1        see metadata
## 314 Natuurpunt:Waarnemingen:185213553               1        see metadata
##     infraspecificEpithet                      installationKey institutionCode
## 115                 <NA> 9f25fd85-85dc-4dcd-a1b4-b31165442e2b      Natuurpunt
## 313                 <NA> 9f25fd85-85dc-4dcd-a1b4-b31165442e2b      Natuurpunt
## 314                 <NA> 9f25fd85-85dc-4dcd-a1b4-b31165442e2b      Natuurpunt
##     institutionID institutionKey isInCluster ISO2 isSequenced
## 115          <NA>           <NA>       FALSE   BE       FALSE
## 313          <NA>           <NA>       FALSE   BE       FALSE
## 314          <NA>           <NA>       FALSE   BE       FALSE
##     iucnRedListCategory        key  kingdom kingdomKey language
## 115                  NE 4930264131 Animalia          1       en
## 313                  NE 2573319358 Animalia          1       en
## 314                  NE 2573319423 Animalia          1       en
##                       lastCrawled               lastInterpreted
## 115 2025-06-11T10:03:35.552+00:00 2025-06-11T10:20:46.497+00:00
## 313 2025-06-11T10:03:35.552+00:00 2025-06-11T10:22:25.631+00:00
## 314 2025-06-11T10:03:35.552+00:00 2025-06-11T10:22:25.653+00:00
##                        lastParsed      lat
## 115 2025-06-11T10:20:46.497+00:00 51.21534
## 313 2025-06-11T10:22:25.631+00:00 51.21180
## 314 2025-06-11T10:22:25.653+00:00 51.21180
##                                                     license lifeStage locality
## 115 http://creativecommons.org/licenses/by-nc/4.0/legalcode     Adult     <NA>
## 313 http://creativecommons.org/licenses/by-nc/4.0/legalcode     Adult     <NA>
## 314 http://creativecommons.org/licenses/by-nc/4.0/legalcode     Adult     <NA>
##     locationID locationRemarks     lon materialEntityID modified month
## 115       <NA>            <NA> 4.42175             <NA>     <NA>     9
## 313       <NA>            <NA> 4.41615             <NA>     <NA>     8
## 314       <NA>            <NA> 4.41615             <NA>     <NA>     8
##     municipality nameAccordingTo nomenclaturalCode
## 115    Antwerpen            <NA>              ICZN
## 313    Antwerpen            <NA>              ICZN
## 314    Antwerpen            <NA>              ICZN
##                          occurrenceID occurrenceRemarks occurrenceStatus
## 115 Natuurpunt:Waarnemingen:327260672              <NA>          PRESENT
## 313 Natuurpunt:Waarnemingen:185072398              <NA>          PRESENT
## 314 Natuurpunt:Waarnemingen:185213553              <NA>          PRESENT
##           order orderKey organismID organismQuantity organismQuantityType
## 115 Lepidoptera      797       <NA>               NA                 <NA>
## 313 Lepidoptera      797       <NA>               NA                 <NA>
## 314 Lepidoptera      797       <NA>               NA                 <NA>
##     originalNameUsage otherCatalogNumbers ownerInstitutionCode parentNameUsage
## 115              <NA>                <NA>                 <NA>            <NA>
## 313              <NA>                <NA>                 <NA>            <NA>
## 314              <NA>                <NA>                 <NA>            <NA>
##         phylum phylumKey preparations previousIdentifications programmeAcronym
## 115 Arthropoda        54         <NA>                    <NA>             <NA>
## 313 Arthropoda        54         <NA>                    <NA>             <NA>
## 314 Arthropoda        54         <NA>                    <NA>             <NA>
##     projectId protocol publishedByGbifRegion publishingCountry
## 115      <NA>      EML                EUROPE                BE
## 313      <NA>      EML                EUROPE                BE
## 314      <NA>      EML                EUROPE                BE
##                         publishingOrgKey recordedBy recordNumber
## 115 4d3ceea8-5699-439d-a899-decac9cbbdac       <NA>         <NA>
## 313 4d3ceea8-5699-439d-a899-decac9cbbdac       <NA>         <NA>
## 314 4d3ceea8-5699-439d-a899-decac9cbbdac       <NA>         <NA>
##                                        references reproductiveCondition rights
## 115 https://waarnemingen.be/observation/327260672                  <NA>   <NA>
## 313 https://waarnemingen.be/observation/185072398                  <NA>   <NA>
## 314 https://waarnemingen.be/observation/185213553                  <NA>   <NA>
##          rightsHolder sampleSizeUnit sampleSizeValue samplingEffort
## 115 Natuurpunt Studie           <NA>              NA           <NA>
## 313 Natuurpunt Studie           <NA>              NA           <NA>
## 314 Natuurpunt Studie           <NA>              NA           <NA>
##     samplingProtocol                   scientificName scientificNameID  sex
## 115             seen Morpho menelaus (Linnaeus, 1758)             <NA> <NA>
## 313             seen Morpho menelaus (Linnaeus, 1758)             <NA> <NA>
## 314             seen Morpho menelaus (Linnaeus, 1758)             <NA> <NA>
##             species speciesKey specificEpithet startDayOfYear subfamily
## 115 Morpho menelaus    5133523        menelaus            249      <NA>
## 313 Morpho menelaus    5133523        menelaus            238      <NA>
## 314 Morpho menelaus    5133523        menelaus            238      <NA>
##     superfamily taxonConceptID                                taxonID taxonKey
## 115        <NA>           <NA> https://waarnemingen.be/species/237778  5133523
## 313        <NA>           <NA> https://waarnemingen.be/species/237778  5133523
## 314        <NA>           <NA> https://waarnemingen.be/species/237778  5133523
##     taxonomicStatus taxonRank taxonRemarks tribe  type typeStatus typifiedName
## 115        ACCEPTED   SPECIES         <NA>  <NA> Event       <NA>         <NA>
## 313        ACCEPTED   SPECIES         <NA>  <NA> Event       <NA>         <NA>
## 314        ACCEPTED   SPECIES         <NA>  <NA> Event       <NA>         <NA>
##     verbatimCoordinateSystem verbatimElevation verbatimEventDate
## 115                     <NA>              <NA>              <NA>
## 313                     <NA>              <NA>              <NA>
## 314                     <NA>              <NA>              <NA>
##     verbatimIdentification verbatimLabel verbatimLocality verbatimSRS
## 115                   <NA>          <NA>             <NA>        <NA>
## 313                   <NA>          <NA>             <NA>        <NA>
## 314                   <NA>          <NA>             <NA>        <NA>
##     verbatimTaxonRank vernacularName vitality waterBody year downloadDate
## 115              <NA>           <NA>     <NA>      <NA> 2024   2025-06-17
## 313              <NA>           <NA>     <NA>      <NA> 2019   2025-06-17
## 314              <NA>           <NA>     <NA>      <NA> 2019   2025-06-17
##      lon360
## 115 4.42175
## 313 4.41615
## 314 4.41615

These butterflies are in Antwerp, in Belgium, where there is a very famous zoo – and when I search for information about it, it appears it has a butterfly garden! I suspect these are captive specimens, so I want to exclude them from my data set.

remove  <-  c(test1)
df <- df[-remove,] #remove the incorrect points
rm(remove)

What’s left?

map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)

Those all look like reasonable places for blue morphos to live. Keep cleaning yours until you’ve gotten rid of any other data points that make no sense.

In a longer-term research project intended for publication, you would spend a lot more time on the data cleaning step, and indeed there are programs and functions for doing exactly that, but for today let’s leave it here.

Mapping species range

Now, how should we visualize the species range? We’ll start by drawing a polygon that encloses all the points (this is called a “hull”).

require(sf); require(concaveman) #load mapping libraries
## Loading required package: sf
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
## Loading required package: concaveman
sfdata <- st_as_sf(df,coords = c("lon360","lat")) #this reformats the coordinate points into a special data structure

conc <- concaveman(sfdata,concavity = 3,length_threshold = 0) #this is called a concave hull, it's a polygon that contains all the points

conv <- convHull(df[,c("lon360","lat")]) #this is called a convex hull, it's just a polygon drawn around all the points that stick out the most

Then make a map that shows the concave and convex hulls:

map("world2",col = "green",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
plot(conv,add = T,col = rgb(1,1,0,0.3),lty = "blank")
plot(conc,add = T,col = rgb(1,0,0,0.3),lty = "blank")
legend("topright",col = c(rgb(1,1,0,0.3),rgb(1,0,0,0.3)),
       legend = c("convex","concave"),pch = 15,bty = "n")

This isn’t very satisfactory as a map of species range, as it doesn’t take any notice of whether your species could actually live in all the places in between the points you plotted. In the next part we’ll look at some environmental data to see if we can figure out a better way.

To save your map to your class file, click Export>Save as Image. Give it a name that contains the species name and your name.

Part 4: Environmental data

Download the climatic data from the WorldClim website.

require(geodata); require(raster);require(here)
## Loading required package: geodata
## Loading required package: terra
## terra 1.8.54
## Loading required package: here
## here() starts at /Users/jblois/Documents/GitHub/biodata_shortcourse/development
climate <- worldclim_global(var = 'bio',res = 2.5,path = here())
climate <- stack(climate)

The variable climate now contains a special data structure called a “RasterStack”, which consists of some number of matrixes of exactly the same dimensions. (Think of it like a neatly aligned stack of maps.)

names(climate) #these names are annoyingly long, let's rename them
##  [1] "wc2.1_2.5m_bio_1"  "wc2.1_2.5m_bio_2"  "wc2.1_2.5m_bio_3" 
##  [4] "wc2.1_2.5m_bio_4"  "wc2.1_2.5m_bio_5"  "wc2.1_2.5m_bio_6" 
##  [7] "wc2.1_2.5m_bio_7"  "wc2.1_2.5m_bio_8"  "wc2.1_2.5m_bio_9" 
## [10] "wc2.1_2.5m_bio_10" "wc2.1_2.5m_bio_11" "wc2.1_2.5m_bio_12"
## [13] "wc2.1_2.5m_bio_13" "wc2.1_2.5m_bio_14" "wc2.1_2.5m_bio_15"
## [16] "wc2.1_2.5m_bio_16" "wc2.1_2.5m_bio_17" "wc2.1_2.5m_bio_18"
## [19] "wc2.1_2.5m_bio_19"
names(climate) <- unlist(sapply(1:19,function(x) paste0("bio",x)))
names(climate)
##  [1] "bio1"  "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8"  "bio9" 
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"

In the case of this climate data file that we just downloaded, those maps contain the values of 19 different climatic variables that are frequently relevant to species distributions, for all the land surface in the whole world (not the oceans).

Viewing the environmental data

You can plot any one of the layers to have a look at it. Call it by its name, using the $ operator, as an argument to the plot() function.

plot(climate$bio1)

This layer, bio1, is the average annual temperature. To see what each of the 19 bioclimatic variables means, look at https://www.worldclim.org/data/bioclim.html. Temperature measurements are given in tenths of a degree Celsius; precipitation is in millimeters.

Then you can plot your own species occurrence data on top of it, restricting the range of the map to the range of your occurrences plus 1 degree in each direction, the same way we did in part 3. The climate data layers report longitude as going from -180 to 180, so we have to go back to the original longitude column (“lon”, not “lon360”):

plot(climate$bio1,
     xlim = range(df$lon,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
points(df$lon,df$lat,col = "red",pch = 20)


Thinking about it

Try this with each of the bioclimatic data layers in bioclim (bio1 through bio19). - Do any of the bioclimatic variables seem to be important in controlling the range of your species? - If so, which ones? Save the images to your class folder for later reference. - What do you think about this? – Are you surprised by the results? – Can you think of a reason why these particular climatic variables might have a lot to do with the possible range of your species?

Tomorrow we’ll do a quantitative model with these data to answer the same question!

Saving your data

Save your data so you can load it again tomorrow. This is not straightforward on UC Merced computer lab computers, so please follow ALL of the following steps:

  1. Choose Session>Save Workspace As…
  2. In the popup window, choose Documents from the list on the left side under Quick Access.
  3. Give the file a UNIQUE name with YOUR NAME in it and click Save.

Your instructor will make sure these files are here for you to load tomorrow morning.